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Abstract 

Mean-Field is an efficient way to approximate a posterior distribution in complex graph¬ 
ical models and constitutes the most popular class of Bayesian variational approximation 
methods. In most applications, the mean field distribution parameters are computed using 
an alternate coordinate minimization. However, the convergence properties of this algo¬ 
rithm remain unclear. In this paper, we show how, by adding an appropriate penalization 
term, we can guarantee convergence to a critical point, while keeping a closed form update 
at each step. A convergence rate estimate can also be derived based on recent results in 
non-convex optimization. 

1. Introduction 

In many situations when a posterior distribution P over variables X depends on a complex 
model, exact inference is not possible and variational inference (Wainwright and Jordan, 
2008; Attias, 2000) is a widespread approach to approximating it. This technique is used 
in domains such as Computer Vision, Natural Language Processing, and large scale Data 
Processing, as in Fleuret et ah (2008); Hu et al. (2014); Ishigaki et ah (2014). 

Mean field variational inference methods approximate P by a product distribution Q, 
which means looking for the distribution Q among a restricted class of product distributions. 
The quality of the approximation is measured in terms of the Kullback-Leibler divergence 
between P and Q. This turns the mean field problem into a non-convex minimization prob¬ 
lem. 

The most popular approach to solving it is the alternate minimization approach (Bishop, 
2008; Koller and Friedman, 2009), also known as the Variational Message Passing algo¬ 
rithm (Winn and Bishop, 2005) in the machine learning community. The Kullback-Leibler 
divergence is minimized coordinate by coordinate in a pre-determined order, until conver¬ 
gence. The main advantage of this algorithm is that the coordinate-wise minimum can be 
computed in closed form at each step. Furthermore, the procedure can be parallelized in 
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most cases, as shown in Bertsekas and Tsitsiklis (1997, p. 21). 

However, convergence is not always guaranteed for the general alternate minimisation 
in the non-convex case. One can find examples where the procedure endlessly loops be¬ 
tween several equivalent local minima which become cluster points of the minimization 
sequence, as shown in Powell (1973). More specifically, convergence can be proven in some 
cases (Tseng and Mangasarian, 2001) but not all. More precisely, the objective function 
always decreases but that does not preclude oscillations in the variables and there is no for¬ 
mal proof that the alternating minimization algorithm for variational inference will never 
loop as in the Powell example. 

Our contribution is the introduction of a special purpose proximal regularisation term 
at each step of the minimization that provably enforces convergence. It dampens potential 
oscillations while preserving the simplicity of the algorithm whose updates are still com¬ 
puted in closed form. We use a recent result from (Attouch et ah, 2013) to prove formally 
that it is indeed the case. 

It is important to understand that, as the objective function is non-convex, our prox¬ 
imal algorithm doesn’t always converge to the same minimum as the classical fixed point 
algorithm. However, the solution found has no reason to be better or worse. Furthermore, 
the proximal term can be chosen arbitrarily small through the parameter A. Therefore, by 
choosing a small A, one can be make the new proximal algorithm follow a trajectory which 
is arbitrarily close to the trajectory of the alternate minimisation. 

Table 1: Notations 


• 11.11 is the Euclidean norm in M^. 

• For a differentiable function f, V/ its gradient. 

• Q = {o'!,..., ^at} is either the probability distribution on iV independent 
Bernoulli variables {Aii,..., or a vector in [0,1]'^ 

• If / is a function and X a random variable Eq {f{X)) is the expected value 
of f{X) under probability Q. 

• If X = {Xi,...,XAr} are independent Bernoulli variables under Q, 
EQ\.{f{X)\Xi = a) is the expected value of f{X), given Xi = a. 

• If {X*} is a convergent sequence in , its unique limit point is denoted 
byX. 
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2. Variational inference problem 

We first recall the general formulation of KL divergence minimization problems as they 
appear in variational inference problems. 

We assume that we are working with N random variables {Xi ,..., V^v} whose posterior 
distribution is taken from the exponential family (as in Winn and Bishop (2005) and Bishop 
(2008)), with marginal priors p^(Xj). The energy function is denoted by T. We make the 
important assumption that it is bounded. 


i 

Where Z is a normalisation factor. 

Following the traditional mean field approach (Bishop, 2008; Roller and Friedman, 2009), 
we are now trying to get a tractable representation Q{X) of this probability distribution 
P{X). By tractable, we mean a distribution that we can easily manipulate, sample from 
and calculate expectancies. We are therefore approximating P hy Q, among the product 

distributions. Q{X) = WQi{Xi). It means, that we will look for Q which is closest to P in 

i 

the sense of the KL divergence KL{Q\\P). 


For the sake of simplicity, it is assumed in the following that Xi are Bernoulli variables 
(i.e in {Xi,... ,Xj\f} E {0,1}'^). However, we could easily work with non-binary random 
variables. 

Therefore, the approximating distribution can be written as : Q{X) = n Qi{Xi) = 

The general form of the KL divergence is : 


KL{Q\\P) 



( 1 ) 


Which we can rewrite as the sum of a multivariate polynomial and univariate convex 
functions : 


Where : 


KL{Q\\P)= 

xGfO,!}^ i i 

fMi) = log + log 


( 2 ) 

(3) 
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We introduce the functions G({gi,..., qn}) and r2({gi, ..., qn}) such that: 
G{{qi, qn}) = KL{{qi,qN}\\P) 


(4) 

(5) 


xG{0,1}^ * i 


= • ,9Ar}) 


( 6 ) 


Where : 


Definition 1 


n{{qu...,qN}) :=Eq{^{X)) 
The KL divergence minimization is thus the following : 


(7) 


argmin G({gi,^Ar}) 


( 8 ) 


{gi,...,gAr} 


This problem is obviously non-convex as it involves a sum of multiple products. Therefore, 
finding a global minimum can be cumbersome in large dimensions. In the next section, an 
algorithm which yields a sequence converging to a first-order critical point is introduced. An 
estimate of the local convergence rate can also be derived, based on Attouch et al. (2013). 

3. Proximal alternate minimisation algorithm 

In this section, we derive a tractable algorithm that converges to a first order stationary 
point of the problem of Eq. 8, with convergence guaranties and a provable asymptotic con¬ 
vergence rate. 

Although the alternate minimisation algorithm produces a decreasing sequence of objective 
functions, there is a-priori no guarantee that the variable sequence actually converges as 
demonstrated by Absil et al. (2005). Powell (1973) shows examples of minimisation prob¬ 
lems for which a coordinate descent method fails to converge. 

However, we show in this paper, that, by adding a proximal regularisation, we can use 
the Kurdyka-Lojasiewicz inequality and recent work by Attouch and Bolte to prove con¬ 
vergence. The specihc form of the penalty term lets us retain the ability to compute the 
updates in closed form in the case of variational inference. 

Regularisation We are using a regularisation function which is the KL divergence be¬ 
tween the one dimensional iterates. During the iterations, this proximal function l{q,qo) 
penalises the variables which are too different from their previous value. 



(9) 

( 10 ) 

( 11 ) 


KL{B{Q)\\B{Qo) 


Given qo, l{q,qo) is strongly convex with regards to q, positive, continuous on ]0,1[^. Its 
minimum is 0 for q = qo- 
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It is worth noting that the derivative of this function on the minimisation variable x is 
simple as well and can be written as follows : 

i'(,.*) = ^ = log(^)-l„g(^) (12) 

Remark 2 If X is not binary, then, we just replace I by the KL divergence between discrete 
random variables. 

Proximal alternate minimization procednre We are looping through the variables, 
minimizing the objective over one variable at a time, the others staying fixed (see alg. 1), 
with the following update rule; 

= argmin{G'({g^+\ ... + Xl{q,qi)} (13) 

<? 

Input: A prior distribution {p^, ...,p^} and a KL function G 
Output: A MF distribution {gi, gv} 

Initialisation to the prior : 

{qi,-,Qn} ^ {Pi,-,Pn} 

Loop untill convergence : 
while ||VG({gi,...,gAr})|| > e do 
for Xi in {1,..., N} do 

qi ^ argminG({gi,...,g, ...,,gAr}) + Xl{q,qi) 

<? 

end 

end 

return {qi,...,qN} 

Algorithm 1: KL Proximal alternate minimisation 

The main advantage of our penalization method (e.g l{q,q\)) over the quadratic one 
(e.g 11 (7 — IP as in Attouch et al. (2010)) is that the update is computed in closed form. 
Indeed, the minimization is differentiable and convex on q. Therefore, the first order one 
dimensional minimality condition gives : 


= 


1 + exp 


1 


^1 + A 
With the notation: 


EQt ^{X)\Xi = 1) - EQt ^{X)\X, = 0) + log 

\i 




Pi 


+ A log 
(14) 


Q\i = • • • : 


t+l A 


4 


Remark 3 If X is not binary, a similar closed form minimum is easily obtained by intro¬ 
ducing a Lagrange multiplier as in Beal (2003). 
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4. Convergence of the algorithm 

Our analysis is along the lines of Attouch et al. (2010), using the Kurdyka-Lojasiewicz 
inequality as the key tool in our proof. 

4.1. A general convergence result 

Definition 4 (Kurdyka-Lojasiewicz Property) A differentiable function f is said to 
have the Kurdyka-Lojasiewicz property at x, if there exists ry > 0, a neighborhood U of x 
and a continuous concave functions cf : [0, ry) ^ M+, such that : 

- m = 0 

- (j) is on (0, ry) 

- Vs € (0, ry),(/)'(s) > 0. 

- Vx € [/ n [f{x) > / > f(x) + ry], the following inequality, called Kurdyka-Lojasiewicz 
inequality holds: 


(/)'(/(x)-/(x))||V/(x)||>l 


(15) 


Lemma 5 Let F be any differentiable function from M to and X* a bounded sequence 
which has the three following properties : 

(i) Sufficient decrease : 

3A such that, Vt > 0 

F(X*+^) + -^||X‘+^-X*||2 < F(X*) (16) 

(ii) Gradient bound : 

3C such that, Vt > 0 

||VF(X*)|| < C'||X‘+^ - X*|| (17) 


(in) The function F has the Kurdyka-Lojasiewicz property at all its critical points, with 
(l){s) = cs^~^ and 9 g]0, 1[. 

Then, the sequence converges to a stationary point of F that we denote X. Moreover, 
the following convergence rates apply (depending on 6). 


(a) If9e 



then 3A > 0, 3r > 0 such that: 


X* - X|| < Aff 


(b)ifee 



then 3A > 0 such that: 


(18) 


X* -Xll < ^t-(i-O/(20-i) 


(19) 


Proof The proof of the previous Lemma follows from the recent work of Attouch and 
Bolte. There is no explicit statement of the asymptotic convergence rates in Attouch et al. 
(2013), however, one can strictly follow Attouch et al. (2010). ■ 
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4.2. Properties 
Kurdyka-Lojasiewicz 

Proposition 6 The function G defined in 4 satisfies the Kurdyka-Lojasiewicz Property at 
all its critical points with a function (j){s) = where 0 € [ 2 ’ 

Let us denote by U and rj the associated objects in definition (4)- 

Proof Lojasiewicz (Lojasiewicz (1965, 1984)), showed that any real analytic function has 
the Kurdyka-Lojasiewicz property with cj){s) = for some 6 G [ 2 ’ 

Our function G is obviously analytic and real. Which terminates the proof of Proposi¬ 
tion (6). ■ 


Lemma 7 The sequence {v } belongs to a compact set S c]0,1[^. Let us define : 

J^:=U[qr^,qm 

i 

Proof We know that 41 in bounded. Let us define : 

Vx G {0, 1}^ ^rnin < 4'(x) < ^ rnax 

{l,...,iV} qr^ = -7- 


1 


{l,...,iV} qr^ = 


1 4“ 6Np ^min 4“ log 

1 


tiA 

, Pi 


1 4“ exp min ^max 4“ log f 


( 20 ) 


( 21 ) 


Then, if we assume that q^ G using (14), (20), (21), we can write the 

following : 


log 


1-q 


,t+i' 


1 




< 


< 


1 + A 
1 

1 + A 
1 


Kq*X4/(X)|W = 1) - EQ^^X^iX)\Xi = 0) + log j + Alog 


^max ^min 


log 


1 


Pi 


+ Alog 






1 + A 


log 


1-9" 


+ Alog 


/ 1 - gr 

I ^rain 


< log 


i-gr 


By monotonicity and conversely for the upper bound, we conclude, that q^^^ G [o'™”, 
Therefore, by induction, as long as G S, {Q^ = for instance), ^TiMt 
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Sufficient decrease 

Lemma 8 The penalization I (Equation (9)) is 1-strongly eonvex on ]0,1[. 

For all X and xq in ]0,1[, -||a: — xqII^ < ®o) 


Therefore : 


( 22 ) 


Proof 

By a simple differentiation of 1, we get : 

d^l{x,xo) ^1 1 

dx^ X (1 — x) ^ 

Then, by definition of the strong convexity, combined with l(xo,xo) = 0, and ^'(xo,xo) = 0, 
we get the second part of the Lemma. ■ 


Proposition 9 Our alternate minimization algorithm has the following sufficient decrease 
property . 

For all indices t > 1, 


- QT < G{Q") 


Proof 

An elementary induction gives, for each step : 

G{{q\+\ ... , qt\,9t\4+l.q'N}) + , qt\,ql dU^q^}) (23) 

Therefore, using the same equations for i = A^}, it easily follows : 


G(Q*+^) + L(Q*+\ Q*) < G(Q*) 

And by strong convexity property of Lemma 8, we get : 

- Q'\? < g{q^) 


(24) 


Gradient bound 
Lemma 10 

n, defined in 1 is Kq — Lipschitz with Kq = ^maxVN- 


Proof For any f in 1,..., : 



EQt/fi^{X)\X, = 1) - EQt/i{^{X)\Xi = 0)1 < ^max 
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Therefore, using the classical inequality between L 2 and norms : 

||VQ(Q')|| K^maxVN 


Lemma 11 

There exists a positive eonstant Ki sueh that for any Q and Q inT,: 

Vi S {1,A^}, \l\qi,qi)\ < 2Ki\qi - 


(25) 


Proof For any i in {0, the function x log{x) is Lispschitz continuous on 

^maxj Lipschitz Constant ^ . And the function x —> log{l — x) is Lisps- 


q- 

chitz continuous on with Lipschitz constant 

Therefore, according to Eq.l2 

Vx S [qr\qr% Vxo S [qT^,qr% \l’{x,Xo)\ < 


1 




1 1 


l-qf 


\q - 9ol 


Therefore, if we simply set Ki such that : Ki = maxje{i ) > Eq.25 


comes directly. 




1 - 9 ; 


,max 

i 


Lemma 12 For any index n > 1, the following bound on the gradient of G holds : 

||VG(Q“)|| < {2Ki + VN^Kn)\\Q^ - Q^-^\\ (26) 

Proof Let us choose u > 1. Eor any i, from the first order minimization condition in 
Eq. 13, we know that : 


dG 


0 = Xlfqf, qf-^) + —{{qf ,... , qf_„qf, • • • , 9^^) 
Which we can rewrite as, using the decomposition on G : 


dG 


0 = \l'{qf, qri + ^{{qf, • • • , qU,qf. (?r+T, • • • , 9^) 

hC 

= \l\qf, qf-^) + —{{qf ,... , qf_„qf, ?“+!, • • • , dl)) 

BQ df- 

- • • • ’ ^i+i’ • • • ’ ^n}) - 

+ |-({gr,..., Cl, Cl', • • •, c'}) + ^(^r) 


(27) 
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Using equation (25), and the Lipschitz constant of (see lemma 10), we get: 

BC 

^(Q“+') < 2K\\qr - ^r'li+ KnWQ^ - w, • • • > • • • 

f)C 

— (Q*) < 2K\\q^ - qr^W + Kn\\Q^ - Q^-^\\ (28) 

dQi 

Combining equation 28 for z = {1,..., — 1} we get : 

||VG(Q“)|| < i2K + VN^Kn)\\Q" - Q^-^\\ (29) 


4.3. Convergence 

We showed in the previous section (Lemma 12, Proposition 9 and Proposition 6) that the 
sequence generated by our new minimization procedure has the three sufficient properties for 
convergence, as shown in Lemma 5. Therefore, according to Lemma 5 (or Attouch and Bolte 
(2009)), the main Theorem of this paper can be stated as follows. 

Theorem 13 (Convergence) The sequence {Q^} generated by the proximal alternate min¬ 
imization procedure described in algorithm 13, converges to a critical point of F, denoted 

Q. 

Corollary 14 The following asymptotic convergence rates hold : 

We recall that 6 is the exponent of the cf) function in the Kurdyka-Lojasiewicz inequality 
such that (j){s) = cs^~^. 

(i) If 6 €]0, -], then BC > 0, 3r > 0 such that: 

||Q*-g||<Cr* (30) 

(a) If 6 S]-, 1[, then 3(7 > 0 such that: 

IIQ* - Qll < (31) 

If we make the standard SSOC assumption on G (the hessian is positive definite at all 
the local minimas), then we can show that the convergence rate toward the local minima is 
linear, as in Equation 30, with 9 = 1/2. 


Proof [Proof of the corollary] The first part of the corollary is also a direct consequence of 
Lemma 5. 

Let us now assume that the Hessian matrix is positive dehnite at all the local minima 
(SSOC assumption). We denote by jxi and fi 2 the highest and lowest eigenvalues of the 
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Hessian in a neighborhood of a local minimum Q. fii and are both positive by SSOC 
and continuity of the Hessian. Let us then write the Taylor formula for G and VG at the 
neighborhood of Q. It follows the existence of a neighborhood U of Q, so that, for all Q € U: 

\G{Q)-Gm<f^i\\Q-Qf (32) 

and 

l|VG(Q)|| >^2||Q-Q|| (33) 

It shows that G follows a Kurdyka-Losajewicz inequality at all its minimal points, with 

(p{s) = c^/s. Therefore, if Q converges toward a minimal point, which has the SSOC, the 
convergence rate is linear with 9 = 1/2. ■ 


5. Conclusion 

Although the convergence of fixed point iterations schemes for mean field minimization is 
often taken for granted, no formal proof exists. In this paper, we have proposed a slightly 
modified scheme that is provably convergent. This addresses a major conceptual weakness 
of one of the most important algorithms used by the Machine Learning community. 
Interestingly, our regularisation can be chosen as small as needed through the parameter 
A. Therefore, our algorithm can be arbitrarily similar to the classical minimisation, while 
guaranteeing convergence. 

In future work, we will explore the practical applications for our scheme. We will look 
for examples where it accelerates convergence. It may prevent infinite, but also temporary 
oscillations between equivalent solutions of a learning optimisation problem. 
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